{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 3: groupby and more (Seaborn) plots\n", "\n", "This lab explores the FBI NICS Firearms Background Check data, which records the number of background check made. A background check must be made prior to *some* sales of firearms (a big exception is private sales.) This data is often used as the best approximation of total gun sales at a given time.\n", "\n", "BuzzFeed converts the PDF data supplied by the FBI to CSV files.\n", "\n", "For more information on the dataset: [https://github.com/BuzzFeedNews/nics-firearm-background-checks](https://github.com/BuzzFeedNews/nics-firearm-background-checks)\n", "\n", "For a direct link to the dataset (current as of July 2019): [https://raw.githubusercontent.com/BuzzFeedNews/nics-firearm-background-checks/master/data/nics-firearm-background-checks.csv](https://raw.githubusercontent.com/BuzzFeedNews/nics-firearm-background-checks/master/data/nics-firearm-background-checks.csv)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn as sns\n", "%matplotlib inline\n", "\n", "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the CSV file into a dataframe called `guns`, and display the dataframe to make sure it was loaded correctly." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make the `month` column into a `datetime` object." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There was no day in the original `month` column. What happens to the day once we convert this column into a `datetime` object?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get a feel for the data, plot the number of handgun background checks (the `handgun` column) made in New York on the y axis and the date on the x axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do you notice about the plot?\n", "\n", "What was the mean number of handgun background checks? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Groupby\n", "\n", "\n", "What if we wanted to find the mean number of handgun checks for each state? Our usual method of filtering would take a while. Instead we will use the *group by* process, which:\n", "- *splits* the data into groups based on some criteria\n", "- *applies* a function to each group independently\n", "- *combines* the results into a data structure\n", "\n", "The splitting step is done by the function `groupby()` and a second function, like `mean()`, is applied to the groups." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "guns.groupby(\"state\").mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we only wanted to see the `handgun` column, we can use:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "guns.groupby(\"state\").mean()[\"handgun\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other functions we can use with `groupby()` are:\n", "- `mean()` : Compute mean of groups\n", "- `sum()` : Compute sum of group values\n", "- `size()` : Compute group sizes\n", "- `count()` : Compute count of group\n", "- `std()` : Standard deviation of groups\n", "- `var()` : Compute variance of groups\n", "- `describe()` : Generates descriptive statistics\n", "- `min()` : Compute min of group values\n", "- `max()` : Compute max of group values\n", "\n", "For example, what is the standard deviation of long gun background checks in all states?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the output of `guns.groupby(\"state\").mean()[\"handgun\"]` looks a lot like the output of `value_counts()`. We can use it to make a bar plot. Try it below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "guns.groupby(\"state\").mean()[\"handgun\"].plot.bar()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Answer:\n", "guns.groupby(\"state\").mean()[\"handgun\"].plot.bar()\n", "
\n", "\n", "We can also use `groupby` for dates. For example, to sum by month:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "guns.groupby(guns[\"month\"].dt.month).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which month has the most background checks for long guns? For handgruns?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn plotting\n", "\n", "[Seaborn](https://seaborn.pydata.org) is a Python package for creating beautiful plots.\n", "\n", "For example, suppose we want to make a scatter plot but use size and color to add more information to the plot.\n", "\n", "In Pandas, make a scatter plot with number of handgun background checks on the x axis and number of long gun background checks on the y axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make the same plot in Seaborn, we use the code:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.relplot(x =\"handgun\", y = \"long_gun\", data = guns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To color the points by the state:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.relplot(x =\"handgun\", y = \"long_gun\", hue = \"state\", data = guns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This plot is a little hard to interpret, so let's make a smaller dataset with only 5 states (whichever 5 you would like)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To size the circles by the total number of permit checks made that month:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.relplot(x =\"handgun\", y = \"long_gun\", hue = \"state\", size = \"permit\", data = guns5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are some large hand gun and long gun background check values. What state are they from?\n", "\n", "What are the maximum values in the `handgun` and `long_gun` columns?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's find a row containing the median handgun value 3280:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "guns.loc[guns[\"handgun\"] == 3280]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now find the rows containing the maximum handgun and long_gun values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Challenges\n", "\n", "- make a hexagonal plot of just the Texas handgun vs. long_gun background check numbers\n", "- choose another Seaborn plot from the [gallery](https://seaborn.pydata.org/examples/index.html). Can you make it with using background check data?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }